Data Manipulation with Pandas


In [ ]:
import pandas as pd
pd.set_option('max_rows', 10)

Categorical Types

  • Pandas provides a convenient dtype for reprsenting categorical, or factor, data

In [ ]:
c = pd.Categorical(['a', 'b', 'b', 'c', 'a', 'b', 'a', 'a', 'a', 'a'])

In [ ]:
c

In [ ]:
c.describe()

In [ ]:
c.codes

In [ ]:
c.categories
  • By default the Categorical type represents an unordered categorical
  • You can provide information about the order of categories

In [ ]:
c.as_ordered()

Support in DataFrames

  • When a Categorical is in a DataFrame, there is a special cat accessor
  • This gives access to all of the features of the Categorical type

In [ ]:
dta = pd.DataFrame.from_dict({'factor': c,
                              'x': np.random.randn(10)})

In [ ]:
dta.head()

In [ ]:
dta.dtypes

In [ ]:
dta.factor.cat

In [ ]:
dta.factor.cat.categories

In [ ]:
dta.factor.describe()

Exercise

  • Load NFS data again. Convert fditemno to a Categorical Type. Use describe.

In [ ]:
# [Solution Here]

In [ ]:
%load solutions/load_nfs_categorical.py

Date and Time Types

Pandas provides conveniences for working with dates

Creating a Range of Dates


In [ ]:
dates = pd.date_range("1/1/2015", periods=75, freq="D")
dates

In [ ]:
y = pd.Series(np.random.randn(75), index=dates)
y.head()

In [ ]:
y.reset_index().dtypes

Support in DataFrames

  • When a datetime type is in a DataFrame, there is a special dt accessor
  • This gives access to all of the features of the datetime type

In [ ]:
dta = (y.reset_index(name='t').
       rename(columns={'index': 'y'}))

In [ ]:
dta.head()

In [ ]:
dta.dtypes

In [ ]:
dta.y.dt.freq

In [ ]:
dta.y.dt.day

Indexing with Dates

  • You can use strings
  • Note: the ending index is inclusive here. This is different than most of the rest of Python

In [ ]:
y.ix["2015-01-01":"2015-01-15"]

DatetimeIndex supports partial string indexing


In [ ]:
y["2015-01"]
  • You can resample to a lower frequency, specifying how to aggregate
  • Uses the DateTeimIndexResampler object

In [ ]:
resample = y.resample("M")

In [ ]:
resample.mean()

Or go to a higher frequency, optionally specifying how to fill in the


In [ ]:
y.asfreq('H', method='ffill')

There are convenience methods to lag and lead time series


In [ ]:
y

In [ ]:
y.shift(1)

In [ ]:
y.shift(-1)

Rolling and Window Functions

  • Pandas also provides a number of convenience functions for working on rolling or moving windows of time series through a common interface
  • This interface is the new Rolling object

In [ ]:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', 
                                                          periods=1000))
ts = ts.cumsum()

In [ ]:
rolling = ts.rolling(window=60)
rolling

In [ ]:
rolling.mean()

Exercise

  • Create a datetime colume named 'date' for NFS_1974.csv NFS diary data
    • styr: Survey year
    • stmth: Survey month
    • logday: Day in the log (assume logdays are actual days)
  • Hint: You could do this in two ways
    • Look at the parse_dates keyword of read_csv
    • Create the date after reading in the DataFrame

In [ ]:
# [Solution here]

In [ ]:
%load solutions/load_nfs_datetime.py

Merging and Joining DataFrames


In [ ]:
# this is a bit slow because of the date parsing

transit = pd.read_csv("../data/AIS/transit_segments.csv", 
                      parse_dates=['st_time', 'end_time'],
                      infer_datetime_format=True)
vessels = pd.read_csv("../data/AIS/vessel_information.csv")
  • A lot of the time data that comes from relational databases will be normalized
  • I.e., redundant information will be put in separate tables
  • Users are expected to merge or join tables to work with them

In [ ]:
vessels.head()

In [ ]:
transit.head()
  • Several ships in the vessels data have traveled multiple segments as we would expect
  • Matching the names in the transit data to the vessels data is thus a many-to-one match
  • aside pandas Indices (of which Columns are one) are set-like

In [ ]:
vessels.columns.intersection(transit.columns)

Merging

  • We can combine these two datasets for a many-to-one match
  • merge will use the common columns if we do not explicitly specify the columns

In [ ]:
transit.merge(vessels).head()

Watch out, when merging on columns, indices are discarded


In [ ]:
A = pd.DataFrame(np.random.randn(25, 2), 
                 index=pd.date_range('1/1/2015', periods=25))
A[2] = np.repeat(list('abcde'), 5)
A

In [ ]:
B = pd.DataFrame(np.random.randn(5, 2))
B[2] = list('abcde')
B

In [ ]:
A.merge(B, on=2)

Joins

  • Join is like merge, but it works on the indices
  • The same could be achieved with merge and the left_index and right_index keywords

In [ ]:
transit.set_index('mmsi', inplace=True)
vessels.set_index('mmsi', inplace=True)

In [ ]:
transit.join(vessels).head()

Exercise

  • Join the 1974 Household NFS data with the Diary data
  • The data is in ../data/NationalFoodSurvey/NFS_1974/

In [ ]:
%load solutions/join_nfs.py

Concatenation

  • Another common operation is appending data row-wise or column-wise to an existing dataset
  • We can use the concat function for this
  • Let's import two microbiome datasets, each consisting of counts of microorganisms from a particular patient.
  • We will use the first column of each dataset as the index.
  • The index is the unique biological classification of each organism, beginning with domain, phylum, class, and for some organisms, going all the way down to the genus level.

In [ ]:
df1 = pd.read_csv('../data/ebola/guinea_data/2014-08-04.csv', 
                  index_col=['Date', 'Description'])
df2 = pd.read_csv('../data/ebola/guinea_data/2014-08-26.csv',
                 index_col=['Date', 'Description'])

In [ ]:
df1.shape

In [ ]:
df2.shape

In [ ]:
df1.head()

In [ ]:
df2.head()

In [ ]:
df1.index.is_unique

In [ ]:
df2.index.is_unique

We can concatenate on the rows


In [ ]:
df = pd.concat((df1, df2), axis=0)
df.shape

Exercise

  • Join all of the diary data together in a single DataFrame
    • Hint: you might find glob.glob useful
  • You will need to add a unique field identifying the survey year to each DataFrame
    • Hint: you might find a regular expression using re.search useful

In [ ]:
# [Solution here]

In [ ]:
%load solutions/concat_nfs.py

Text Data Manipulation

  • Much like the cat and dt accessors we've already seen
  • String types have a str accessor that provides fast string operations on columns

In [ ]:
vessels.type
  • Count the vessel separators

In [ ]:
vessels.type.str.count('/').max()
  • Split on these accessors and expand to return a DataFrame with nan-padding

In [ ]:
vessels.type.str.split('/', expand=True)

Exercise

  • Load the file "Ref_ food groups.txt"
  • Get all of the food groups that contain the word milk

In [ ]:
# [Solution here]

In [ ]:
%load solutions/nfs_dairy.py